Bayesian Logistic Regression Capstone
  • Research
  • Slides
  • About Us

On this page

  • Group Project Workflow and Contributions
  • Introduction
    • Aims
    • Logical Flow for Analysis
  • Method and Data Preparation
    • Statistical Tool
    • Data pre-processing and cleaning
    • Exploratory data analysis
  • Statistical Methods and Modeling
    • Multiple Logistic regression on survey weighted dataset
    • Multivariate Imputation by Chained Equations
    • Bayesian Logistic Regression
  • Bayesian Logistic Regression procedure
    • Training and testing
  • Posterior Draws
  • Visualization
  • **Estimated R²
    • Comparative Visualizations
  • Results
    • Multiple Linear Regression Equation
    • Multivariate Imputation by Chained Equations
    • Convergence & Diagnostics
    • Interpretation
    • Comparing Models
  • Conclusion
  • Discussions
  • Limitations
  • QUESTION for targeted therapy
    • Targeted bmi
    • Implications
    • References

Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)

CapStone Project_2025

Author

Namita Mishra, Autumn Wilcox (Advisor: Dr. Ashraf Cohen)

Published

October 23, 2025

Slides: slides.html ( Go to slides.qmd to edit)


Group Project Workflow and Contributions

  • Autumn Wilcox – Contributed to analytic coding, content draft, structured the project workflow, and collaborated actively via GitHub.
  • Namita Mishra – Developed the project plan, content draft, analytic coding, and coordinated commits and collaboration with the group on GitHub.

Introduction

Diabetes mellitus (DM) is a major public health concern closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for targeted interventions D’Angelo and Ran (2025). Logistic Regression (traditional) that estimates the association between risk factors and outcomes is insufficient in analyzing the complex healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in samples that are small, have missing data, or presents quasi- and complete separation.

Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) allow analysis of multivariate longitudinal healthcare data with repeated measures within individuals and individuals nested in a population. By integrating prior knowledge and including exogenous (e.g., age, clinical history) and endogenous (e.g., current treatment) covariates, Bayesian models provide posterior distributions and risk predictions for conditions such as pneumonia, prostate cancer, and mental disorders. Parametric assumptions remain a limitation of these models.

In Bayesian inference Chatzimichail and Hatjimihail (2023), Bayesian inference has shown that parametric models (with fixed parameters) often underperform compared to nonparametric models, which do not assume a prior distribution. Posterior probabilities from Bayesian approaches improve disease classification and better capture heterogeneity in skewed, bimodal, or multimodal data distributions. Bayesian nonparametric models are flexible and robust, integrating multiple diagnostic tests and priors to enhance accuracy and precision, though reliance on prior information and restricted access to resources can limit applicability. Combining Bayesian methods with other statistical or computational approaches helps address systemic biases, incomplete data, and non-representative datasets.

The Bayesian framework described by R. van de Schoot et al. (2021) highlights the role of priors, data modeling, inference, posterior sampling, variational inference, and variable selection.Proper variable selection mitigates multicollinearity, overfitting, and limited sampling, improving predictive performance. Priors can be informative, weakly informative, or diffuse, and can be elicited from expert opinion, generic knowledge, or data-driven methods. Sensitivity analysis evaluates the alignment of priors with likelihoods, while MCMC simulations (e.g., brms, blavaan in R) empirically estimate posterior distributions. Spatial and temporal Bayesian models have applications in large-scale cancer genomics, identifying molecular interactions, mutational signatures, patient stratification, and cancer evolution, though temporal autocorrelation and subjective prior elicitation can be limiting.

Bayesian normal linear regression has been applied in metrology for instrument calibration using conjugate Normal–Inverse-Gamma priors Klauenberg et al. (2015). Hierarchical priors add flexibility by modeling uncertainty across multiple levels, improving robustness and interpretability. Bayesian hierarchical/meta-analytic linear regression incorporates both exchangeable and unexchangeable prior information, addressing multiple testing challenges, small sample sizes, and complex relationships among regression parameters across studies Leeuw and Klugkist (2012)

A sequential clinical reasoning model Liu et al. (2013) Sequential clinical reasoning models demonstrate screening by adding predictors stepwise: (1) demographics, (2) metabolic components, and (3) conventional risk factors, incorporating priors and mimicking clinical evaluation. This approach captures ecological heterogeneity and improves baseline risk estimation, though interactions between predictors and external cross-validation remain limitations.

Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) in clinical research by classifying missing values (e.g., patient refusal, loss to follow-up, mechanical errors) as MAR, MNAR, or MCAR. Multiple imputation generates plausible values across datasets and pools results for reliable classification of patient health status and mortality.

Aims

The present study aims performs Bayesian logistic regression to predict diabetes status and evaluate the associations between diabetes and predictors (body mass index (BMI), age (≥20 years), gender, and race). The study anakyzes a retrospective dataset (2013–2014 NHANES survey data). It is based on a complex sampling design, characterized by stratification, clustering, and oversampling of specific population subgroups, rather than uniform random sampling. A Bayesian analytical approach addresses challenges posed by dataset anomalies such as missing data, complete case analysis, and separation that limit the efficiency and reliability of traditional logistic regression in predicting health outcomes.

Logical Flow for Analysis

  1. Data Preparation Preparation and analysis of the observed or imputed dataset (e.g., adult_imp1.csv) to handle missing values and ensure completeness. (2)Train-Test Split Splitting the dataset into training (to fit the model) and testing (to evaluate model performance) sets to ensures the model is generalizable and avoids overfitting.
  2. Frequentist Approach Fit classical regression or machine learning models on complete dataset to obtain point estimates (e.g., coefficients, odds ratios) and confidence intervals.
  3. Bayesian Approach Fit a Bayesian model using on the imputed dataset to obtain posterior distributions (posterior draws) for parameters to quantify uncertainty. Use posterior draws to generate posterior predictive distributions on the complete data set.
  4. Targeted Intervention Analysis Compare the two models Use of Bayesian models to simulate or assess interventions Identify modifiable risk factors (e.g., BMI, lifestyle factors). Predict how change in a risk factor affects outcomes (e.g., diabetes risk). Bayesian analysis to quantify uncertainty in intervention effects using posterior predictive distributions.
  5. Inference & Decision-Making Combining insights from both approaches to guide data-driven decisions or public health recommendations. Frequentist results provide point estimates, while Bayesian results provide full uncertainty quantification.

Method and Data Preparation

  • The study performs Bayesian logistic regression to predict diabetes status and estimate the association of key predictors (body mass index (BMI), age, gender, and race) using the 2013–2014 NHANES dataset.
  • The complex survey design of NHANES, with stratification, clustering, and oversampling was accounted for to ensure accurate estimation.
  • Posterior predictive distributions generated quantify the probability of diabetes for each individual, providing a flexible and robust framework for uncertainty quantification.
  • Model predictions are also compared against known population prevalence, enabling validation and assessment of model realism.
  • By integrating prior information with the observed data, identifies individuals at high risk for diabetes and informs targeted public health interventions, demonstrating the utility of Bayesian methods in analyzing complex, multivariate health data.

Statistical Tool

  • R, R packages and libraries are used to import data, perform data wrangling and analysis. ## Data source
  • NHANES 2-year data (2013-2014) - a cross-sectional weighted data Center for Health Statistics (1999).
Code
# loading packages 
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("nhanesA", dependencies = TRUE)

install.packages("nhanesA")
library ("nhanesA")    
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(Hmisc)
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(classpackage)
library(janitor)
install.packages("gt")   
library(gt)
library(survey)
library(DataExplorer)
library(logistf)

Data pre-processing and cleaning

  • Three datasets: demographics, exam, questionnaire in.XPT format files are imported (Haven package) in R. After selecting variables of interest, a merged dataset is created from the original weighted datasets (demographics, exam, questionnaire) and merged using ID to create a single merged dataframe.
  1. Response Variable: Binary
  • Type 2 / diagnosed diabetes(excluding gestational diabetes) -“Doctor told you have diabetes?” DIQ010 combined with DIQ050 a secondary variable describing treatment status (insulin use) to exclude those cases
  1. Predictor Variables
  • Body Mass Index, factor, 4 levels are analyzed after standardization).
    o Underweight (<5th percentile)
    o Normal (5th–<85th)
    o Overweight (85th–<95th) o Obese (≥95th percentile)
    o Missing We kept it as it is as categorization provides clinically interpretable groups
  1. Covariates:
    1. Gender (factor, 2 levels): Male: Female
    2. Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including multi-racial
    3. Age (number, continuous)

Merged dataset is cleaned and exploratory data analysis conducted to report results and visualizations for 10175 observations and 10 variables where - - race categorized in 5 levels - age range (0-80 years) - gender (male and female) - Diabetes grouped from (DIQ010 and DIQ050) BMI as continuous.

Code
# weighted means of each variable                       
str(merged_data)
'data.frame':   10175 obs. of  10 variables:
 $ SEQN    : num  73557 73558 73559 73560 73561 ...
 $ RIDAGEYR: num  69 54 72 9 73 56 0 61 42 56 ...
 $ RIAGENDR: Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 1 2 ...
 $ RIDRETH1: Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 2 3 ...
 $ SDMVPSU : num  1 1 1 2 2 1 1 1 2 1 ...
 $ SDMVSTRA: num  112 108 109 109 116 111 105 114 106 112 ...
 $ WTMEC2YR: num  13481 24472 57193 55767 65542 ...
 $ BMXBMI  : num  26.7 28.6 28.9 17.1 19.7 41.7 NA 35.7 NA 26.5 ...
 $ DIQ010  : Factor w/ 5 levels "Yes","No","Borderline",..: 1 1 1 2 2 2 NA 2 2 2 ...
 $ DIQ050  : Factor w/ 4 levels "Yes","No","Refused",..: 1 1 1 2 2 2 NA 2 2 2 ...
Code
plot_str(merged_data)
head(merged_data)
   SEQN RIDAGEYR RIAGENDR           RIDRETH1 SDMVPSU SDMVSTRA WTMEC2YR BMXBMI
1 73557       69     Male Non-Hispanic Black       1      112 13481.04   26.7
2 73558       54     Male Non-Hispanic White       1      108 24471.77   28.6
3 73559       72     Male Non-Hispanic White       1      109 57193.29   28.9
4 73560        9     Male Non-Hispanic White       2      109 55766.51   17.1
5 73561       73   Female Non-Hispanic White       2      116 65541.87   19.7
6 73562       56     Male   Mexican American       1      111 25344.99   41.7
  DIQ010 DIQ050
1    Yes    Yes
2    Yes    Yes
3    Yes    Yes
4     No     No
5     No     No
6     No     No
Code
summary(merged_data)
      SEQN          RIDAGEYR       RIAGENDR   
 Min.   :73557   Min.   : 0.00   Male  :5003  
 1st Qu.:76100   1st Qu.:10.00   Female:5172  
 Median :78644   Median :26.00                
 Mean   :78644   Mean   :31.48                
 3rd Qu.:81188   3rd Qu.:52.00                
 Max.   :83731   Max.   :80.00                
                                              
                                RIDRETH1       SDMVPSU         SDMVSTRA    
 Mexican American                   :1730   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 960   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3674   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2267   Mean   :1.484   Mean   :110.9  
 Other Race - Including Multi-Racial:1544   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
                                                                           
    WTMEC2YR          BMXBMI             DIQ010            DIQ050    
 Min.   :     0   Min.   :12.10   Yes       : 737   Yes       : 220  
 1st Qu.: 12562   1st Qu.:19.70   No        :8841   No        :9545  
 Median : 20175   Median :24.70   Borderline: 185   Refused   :   1  
 Mean   : 30585   Mean   :25.68   Refused   :   1   Don't know:   2  
 3rd Qu.: 36748   3rd Qu.:30.20   Don't know:   5   NA's      : 407  
 Max.   :171395   Max.   :82.90   NA's      : 406                    
                  NA's   :1120                                       
Code
library(dplyr)
library(knitr)

plot_intro(merged_data, title="Figure 1 (Merged dataset). Structure of variables and missing observations.")

Code
plot_missing(merged_data, title="Figure 2(Merged dataset). Breakdown of missing observations.")

Code
# print(glimpse(merged_data))
print(table(merged_data$BMDBMIC, useNA = "ifany"))
< table of extent 0 >
Code
print(table(merged_data$DIQ010,  useNA = "ifany"))

       Yes         No Borderline    Refused Don't know       <NA> 
       737       8841        185          1          5        406 
Code
 #  ---- Coercion helpers (handle labelled/character) ----
to_num <- function(x) {
  if (is.numeric(x)) return(x)
  xc <- as.character(x)
  n <- suppressWarnings(readr::parse_number(xc))
  if (mean(is.na(n)) > 0.80) {
    xlow <- tolower(trimws(xc))
    n <- dplyr::case_when(
      xlow %in% c("1","yes","yes, told") ~ 1,
      xlow %in% c("2","no","no, not told") ~ 2,
      xlow %in% c("3","borderline") ~ 3,
      xlow %in% c("7","refused") ~ 7,
      xlow %in% c("9","don't know","dont know","unknown") ~ 9,
      TRUE ~ NA_real_
    )
  }
  as.numeric(n)
}

merged_data <- merged_data %>%
  mutate(
    DIQ010   = to_num(DIQ010),
    DIQ050   = to_num(if (!"DIQ050" %in% names(.)) NA_real_ else DIQ050),
    BMXBMI   = suppressWarnings(as.numeric(BMXBMI)),
    RIDAGEYR = suppressWarnings(as.numeric(RIDAGEYR)),
    RIAGENDR = suppressWarnings(as.numeric(RIAGENDR)),
    RIDRETH1 = suppressWarnings(as.numeric(RIDRETH1)),
    SDMVPSU  = suppressWarnings(as.numeric(SDMVPSU)),
    SDMVSTRA = suppressWarnings(as.numeric(SDMVSTRA)),
    WTMEC2YR = suppressWarnings(as.numeric(WTMEC2YR))
  )

# ---- Diagnostics BEFORE save ----
cat("DIQ010 counts BEFORE save:\n")
DIQ010 counts BEFORE save:
Code
print(table(merged_data$DIQ010, useNA = "ifany"))

   1    2    3    7    9 <NA> 
 737 8841  185    1    5  406 
Code
cat("Count with DIQ010 in {1,2}:", sum(merged_data$DIQ010 %in% c(1,2), na.rm = TRUE), "\n")
Count with DIQ010 in {1,2}: 9578 
Code
# ---- Save to file for reuse ----
dir.create("data", showWarnings = FALSE)
# ---- Save ----
dir.create("data", showWarnings = FALSE, recursive = TRUE)
saveRDS(merged_data, "data/merged_2013_2014.rds")
message("Saved: data/merged_2013_2014.rds")

Exploratory data analysis

  • Using library(survey) we calculated weighted means and sd of all the variables. The BMI and age were standardized.
  • Age categorized was recoded into different variables, to create age range (20-80 years) >20 years
  • BMI is recoded and categorized as-“18.5,18.5–<25,25–<30,30–<35,35–<40,≥40 years).
  • Ethnicity is recoded as “Mexican American” = “1”, “Other Hispanic” = “2”, “NH White” = “3”, “NH Black” = “4”, “Other/Multi” = “5”
  • Since special codes are not random, cannot be dropped; the informative missingness if ignored (MAR or MNAR) could introduce bias.

We transformed special codes (3,7,) to NA and included all NAs in the analysis. Visualization of missing data is presented below.

A final analytic dataset created (‘adult’) with “NH White” and “Male” as the reference group for analysis

Code
## 
# ---------------- Basic Exploration (adults) ----------------

# Keep adults only and build analysis variables
adult <- merged_data %>%
  dplyr::filter(RIDAGEYR >= 20) %>%
  dplyr::transmute(
    # --- keep survey design variables so svydesign() can see them ---
    SDMVPSU, SDMVSTRA, WTMEC2YR,

    # --- outcome: DIQ010 (1 yes, 2 no; 3/7/9 -> NA) ---
    diabetes_dx = dplyr::case_when(
      DIQ010 == 1 ~ 1,
      DIQ010 == 2 ~ 0,
      DIQ010 %in% c(3, 7, 9) ~ NA_real_,
      TRUE ~ NA_real_
    ),

    # --- predictors (raw) ---
    bmi  = BMXBMI,
    age  = RIDAGEYR,

    # sex (1=Male, 2=Female)
    sex  = forcats::fct_recode(factor(RIAGENDR), Male = "1", Female = "2"),

    # race (5-level)
    race = forcats::fct_recode(
      factor(RIDRETH1),
      "Mexican American" = "1",
      "Other Hispanic"   = "2",
      "NH White"         = "3",
      "NH Black"         = "4",
      "Other/Multi"      = "5"
    ),

    # keep DIQ050 so we can safely reference it (may be absent/NA in some rows)
    
    DIQ050 = DIQ050
  ) %>%
  # standardize continuous predictors
  dplyr::mutate(
    age_c = as.numeric(scale(age)),
    bmi_c = as.numeric(scale(bmi)),
    bmi_cat = cut(
      bmi,
      breaks = c(-Inf, 18.5, 25, 30, 35, 40, Inf),
      labels = c("<18.5","18.5–<25","25–<30","30–<35","35–<40","≥40"),
      right = FALSE
    )
  ) %>%
  # adjust outcome: if female & DIQ050==1 ("only when pregnant"), set to 0 (not diabetes)
  dplyr::mutate(
    diabetes_dx = ifelse(sex == "Female" & !is.na(DIQ050) & DIQ050 == 1, 0, diabetes_dx)
  )

# Make NH White the reference level for race (clearer interpretation)
adult <- adult %>%
  dplyr::mutate(
    race = forcats::fct_relevel(race, "NH White")
  )

# --- sanity checks ---
cat("Adults n =", nrow(adult), "\n")
Adults n = 5769 
Code
glimpse(adult)
Rows: 5,769
Columns: 12
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, NA, 26.5, 22.0, 20.3, …
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ DIQ050      <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ age_c       <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c       <dbl> -0.33588609, -0.07028101, -0.02834336, -1.31443114, 1.7609…
$ bmi_cat     <fct> 25–<30, 25–<30, 25–<30, 18.5–<25, ≥40, 35–<40, NA, 25–<30,…

##Exploratory Description and Visualization of Adult dataset (> 20 years)**

Observations - 5769 observations Survey design: SDMVPSU, SDMVSTRA, WTMEC2YR Outcome: diabetes_dx (numeric 0/1) Covariates: bmi, age, sex, race, DIQ050 Centered covariates: age_c, bmi_c BMI categories: bmi_cat

NHANES is a national surveys based on complex sampling designs (oversampling certain groups (e.g., minorities, older adults) to ensure representation. They use multistage sampling to represent the U.S. population, so we apply sampling weights, strata, and PSU (primary sampling units) for valid estimates.

We use survey design in regression anlaysis to avoid - - bias prevalence estimates (e.g., mean BMI or diabetes %) - underestimation of standard errors - incorrect inference for population-level parameters.

Code
# data exploration

print(table(adult$diabetes_dx, useNA = "ifany"))

   0    1 <NA> 
4974  618  177 
Code
print(table(adult$sex, useNA = "ifany"))

  Male Female 
  2758   3011 
Code
print(table(adult$race, useNA = "ifany"))

        NH White Mexican American   Other Hispanic         NH Black 
            2472              767              508             1177 
     Other/Multi 
             845 
Code
if (sum(!is.na(adult$diabetes_dx)) == 0) {
  stop("Too few non-missing outcomes for modeling (n = 0). Check DIQ010 upstream.")
}

# (optional plots omitted for brevity)

# save for downstream
if (!dir.exists("data")) dir.create("data", recursive = TRUE)
saveRDS(adult, "data/adult_cleaned_2013_2014.rds")
Code
ggplot(adult, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "white") +
  labs(
    title = "Distribution of Age >20 years",
    x = "Age (years)",
    y = "Count"
  ) +
  theme_minimal()

Code
ggplot(adult, aes(factor(diabetes_dx))) +
  geom_bar(fill = "steelblue") +
  labs(title="Diabetes Outcome Distribution in >20 years age group", x="diabetes_dx (0=No, 1=Yes)", y="Count")

Code
ggplot(adult, aes(factor(bmi_cat))) +
  geom_bar(fill = "steelblue") +
  labs(title="Diabetes Outcome Distribution by BMI in >20 years age group", x="bmi_cat")

Code
ggplot(adult, aes(x = factor(diabetes_dx), y = bmi)) +
  geom_boxplot(fill = "skyblue") +
  labs(
    title = "BMI Distribution by Diabetes Diagnosis in >20 years age group",
    x = "Diabetes Diagnosis (0 = No, 1 = Yes)",
    y = "BMI"
  ) +
  theme_minimal()

Code
# plots for adult data bmi categories and race categories

ggplot(adult, aes(x = factor(race), fill = factor(diabetes_dx))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes Diagnosis by Race in >20 years age group",
    x = "Race/Ethnicity",
    y = "Count",
    fill = "Diabetes Diagnosis\n(0 = No, 1 = Yes)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
ggplot(adult, aes(x = factor(bmi_cat), fill = factor(diabetes_dx))) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes Diagnosis by BMI in >20 years age group",
    x = "BMI",
    y = "Count",
    fill = "Diabetes Diagnosis\n(0 = No, 1 = Yes)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Statistical Methods and Modeling

Multiple Logistic regression on survey weighted dataset

We conducted frequentist method Multiple Logistic regression on a survey-weighted dataset, for complete case analysis and data exploration

Multivariate Imputation by Chained Equations

  • We conducted MICE to manage missiging data as an alternative to the Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
  • Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable Van Buuren and Van Buuren (2012).
  • Multiple Imputation (MI) can be performed using mice package in R
  • Iterative mice imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
  • In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.

Bayesian Logistic Regression

  • We conduct bayesian logisitic regression to estimate the association between BMI, age, sex, and race/ethnicity and predict doctor-diagnosed diabetes (DIQ010).
  • Bayesian statistics is about updating beliefs with evidence: Posterior ∝ Likelihood × Prior
  • Prior (p(θ)): Your initial belief about a parameter before seeing the data.
  • Likelihood (p(y|θ)): How probable the observed data are given the parameters. This is derived from the model (e.g., logistic regression likelihood).
  • Posterior (p(θ|y)): Your updated belief about the parameter after seeing the data.
  • We selected Bayesian Logistic Regression since our study response variable is a binary outcome (Diabetes:yes/no)
  • Bayesian Logistic Regression is based on binomial probability Bayes’ rules, and predicts probability of disease outcome
  • Bayes analyzes linear relation between the predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes).
  • Bayes considers that predictors and response variables are independent.
  • Regression a of a discrete variable (0 or 1) is a Bernoulli probability model that classifies categorical response variables - predicting Diabetes.
  • Logit link provides probabilities for the response variable.
  • Prior
    We used student_t(3, 0, 10) prior R. V. D. Schoot et al. (2013) - student_t(ν,μ,σ) Student’s t-distribution with: ν=3: degrees of freedom (controls tail heaviness) μ=0: location (center, like the mean) σ=10: scale (spread, like standard deviation) student_t(3, 0, 10) means: It’s centered at 0 It allows large values (± several times 10) It has heavy tails, since ν=3, allows for outliers or unexpected large parameter values Helps the model remain stable, especially with: # Small datasets # High correlation between predictors # Potential outliers in the data
  • In Bayesian statistics, every unknown parameter (like a regression coefficient, mean, or variance) is treated as a random variable with a probability distribution that reflects uncertainty.

##Summary of Bayesian regression presented under following headings - Posterior Predictive Probabilities - Posterior Mean, Median, credible Intervals - Posterior Probability (Outcome=1) - Comparison with External Prevalence (population prevalence) - Posterior Model Fit Metrics - Prior versus Posterior Coefficient Distributions - Posterior Predictive Checks - Uncertainty Quantification

##Bayesian Logistic Regression model Equation

\[ \text{logit}(P(Y_i=1)) = \beta_0 + \beta_1 \cdot Age_i + \beta_2 \cdot BMI_i + \beta_3 \cdot Race_i + \beta_4 \cdot Gender_i \] Linear Regression equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \] ##Diagnostics for Bayesian Logictic Regression

  • Trace Plots for Markov Chain Monte Carlo Convergence
  • Autocorrelation Plots
  • Posterior Predictive Checks
  • Residual Analysis
  • Prior Sensitivity Analysis
  • Model Fit Assessment
  • Posterior Predictive Probability Plots
  • Posterior Interval Coverage Evaluation
  • Convergence Diagnostics across Chains
Code
# Modeling

library(broom)
library(mice)
library(brms)
library(posterior)
library(bayesplot)
library(knitr)

# --- Guardrails for modeling ---
n_outcome <- sum(!is.na(adult$diabetes_dx))
if (n_outcome == 0) stop("Too few non-missing outcomes for modeling. n = 0")

# Ensure factors and >=2 observed levels among complete outcomes
adult <- adult %>%
  dplyr::mutate(
    sex  = if (!is.factor(sex))  factor(sex)  else sex,
    race = if (!is.factor(race)) factor(race) else race
  )

if (nlevels(droplevels(adult$sex[!is.na(adult$diabetes_dx)]))  < 2)
  stop("sex has <2 observed levels after filtering; check data availability.")
if (nlevels(droplevels(adult$race[!is.na(adult$diabetes_dx)])) < 2)
  stop("race has <2 observed levels after filtering; check Data Prep.")

   #  Survey-weighted complete-case 
# Build a logical filter on the original adult data (same length as design$data)
keep_cc <- with(
  adult,
  !is.na(diabetes_dx) & !is.na(age_c) & !is.na(bmi_c) &
  !is.na(sex) & !is.na(race)
)

# Subset the survey design using the logical vector (same length as original)
des_cc <- subset(nhanes_design_adult, keep_cc)

# Corresponding complete-case data (optional)
cc <- adult[keep_cc, ] |> droplevels()
cat("\nComplete-case N for survey-weighted model:", nrow(cc), "\n")

Complete-case N for survey-weighted model: 5349 
Code
print(table(cc$race))

        NH White Mexican American   Other Hispanic         NH Black 
            2293              713              470             1101 
     Other/Multi 
             772 
Code
print(table(cc$diabetes_dx))

   0    1 
4752  597 
Code
print(table(cc$sex))

  Male Female 
  2551   2798 
Code
form_cc <- diabetes_dx ~ age_c + bmi_c + sex + race
svy_fit <- survey::svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
summary(svy_fit)

Call:
svyglm(formula = form_cc, design = des_cc, family = quasibinomial())

Survey design:
subset(nhanes_design_adult, keep_cc)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -2.67143    0.11935 -22.383 1.68e-08 ***
age_c                 1.10833    0.05042  21.981 1.94e-08 ***
bmi_c                 0.63412    0.05713  11.099 3.88e-06 ***
sexFemale            -0.63844    0.10926  -5.843 0.000386 ***
raceMexican American  0.71091    0.13681   5.196 0.000826 ***
raceOther Hispanic    0.46469    0.13474   3.449 0.008712 ** 
raceNH Black          0.51221    0.15754   3.251 0.011677 *  
raceOther/Multi       0.84460    0.17756   4.757 0.001433 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.8455444)

Number of Fisher Scoring iterations: 6
Code
## plot(residuals(svy_fit, type='deviance'))

# Survey-weighted OR table (no intercept)
svy_or <- broom::tidy(svy_fit, conf.int = TRUE) %>%
  dplyr::mutate(OR = exp(estimate), LCL = exp(conf.low), UCL = exp(conf.high)) %>%
  dplyr::select(term, OR, LCL, UCL, p.value) %>%
  dplyr::filter(term != "(Intercept)")
knitr::kable(svy_or, caption = "Survey-weighted odds ratios (per 1 SD)")
Survey-weighted odds ratios (per 1 SD)
term OR LCL UCL p.value
age_c 3.0292807 2.6967690 3.4027912 0.0000000
bmi_c 1.8853571 1.6526296 2.1508579 0.0000039
sexFemale 0.5281132 0.4104905 0.6794397 0.0003857
raceMexican American 2.0358434 1.4850041 2.7910081 0.0008262
raceOther Hispanic 1.5915182 1.1664529 2.1714810 0.0087119
raceNH Black 1.6689718 1.1605895 2.4000450 0.0116773
raceOther/Multi 2.3270527 1.5451752 3.5045697 0.0014331

The Survey-weighted odds ratios (per 1 SD) are presented. The residual plot looks okay and does not show any pattern.

MICE performed on the imputed dataset - model summary of the pooled imputed dataset presents odds ratio.

Code
# ----- Multiple Imputation (predictors only) 
mi_dat <- adult %>%
  dplyr::select(diabetes_dx, age, bmi, sex, race, WTMEC2YR, SDMVPSU, SDMVSTRA)

meth <- mice::make.method(mi_dat)
pred <- mice::make.predictorMatrix(mi_dat)

# Do not impute outcome
meth["diabetes_dx"] <- ""
pred["diabetes_dx", ] <- 0
pred[,"diabetes_dx"] <- 1

# Imputation methods
meth["age"]  <- "norm"
meth["bmi"]  <- "pmm"
meth["sex"]  <- "polyreg"
meth["race"] <- "polyreg"

# Survey design vars as auxiliaries only
meth[c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- ""
pred[, c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- 1

glimpse(mi_dat)
Rows: 5,769
Columns: 8
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, NA, 26.5, 22.0, 20.3, …
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
Code
imp <- mice::mice(mi_dat, m = 5, method = meth, predictorMatrix = pred, seed = 123)

 iter imp variable
  1   1  bmi
  1   2  bmi
  1   3  bmi
  1   4  bmi
  1   5  bmi
  2   1  bmi
  2   2  bmi
  2   3  bmi
  2   4  bmi
  2   5  bmi
  3   1  bmi
  3   2  bmi
  3   3  bmi
  3   4  bmi
  3   5  bmi
  4   1  bmi
  4   2  bmi
  4   3  bmi
  4   4  bmi
  4   5  bmi
  5   1  bmi
  5   2  bmi
  5   3  bmi
  5   4  bmi
  5   5  bmi
Code
fit_mi <- with(imp, {
  age_c <- as.numeric(scale(age))
  bmi_c <- as.numeric(scale(bmi))
  glm(diabetes_dx ~ age_c + bmi_c + sex + race, family = binomial())
})
pool_mi <- pool(fit_mi)
summary(pool_mi)
                  term   estimate  std.error  statistic       df       p.value
1          (Intercept) -2.6895645 0.09941301 -27.054453 5566.204 1.486581e-151
2                age_c  1.0660265 0.05594733  19.054108 5520.446  1.911564e-78
3                bmi_c  0.5468538 0.04473386  12.224604 5148.557  6.751227e-34
4            sexFemale -0.6178297 0.09379129  -6.587282 5551.660  4.892566e-11
5 raceMexican American  0.8877355 0.13750463   6.456041 5472.583  1.167455e-10
6   raceOther Hispanic  0.5606621 0.17485537   3.206433 5573.987  1.351505e-03
7         raceNH Black  0.6809629 0.11981185   5.683602 5576.734  1.385727e-08
8      raceOther/Multi  0.7476406 0.15300663   4.886328 4749.963  1.061140e-06
Code
## table 

mi_or <- summary(pool_mi, conf.int = TRUE, exponentiate = TRUE) %>%
  dplyr::rename(
    term = term, OR = estimate, LCL = `2.5 %`, UCL = `97.5 %`, p.value = p.value
  ) %>%
  dplyr::filter(term != "(Intercept)")
knitr::kable(mi_or, caption = "MI pooled odds ratios (per 1 SD)")
MI pooled odds ratios (per 1 SD)
term OR std.error statistic df p.value LCL UCL conf.low conf.high
2 age_c 2.9038183 0.0559473 19.054108 5520.446 0.0000000 2.6021752 3.2404277 2.6021752 3.2404277
3 bmi_c 1.7278084 0.0447339 12.224604 5148.557 0.0000000 1.5827382 1.8861754 1.5827382 1.8861754
4 sexFemale 0.5391132 0.0937913 -6.587282 5551.660 0.0000000 0.4485669 0.6479368 0.4485669 0.6479368
5 raceMexican American 2.4296216 0.1375046 6.456041 5472.583 0.0000000 1.8555327 3.1813298 1.8555327 3.1813298
6 raceOther Hispanic 1.7518320 0.1748554 3.206433 5573.987 0.0013515 1.2434346 2.4680953 1.2434346 2.4680953
7 raceNH Black 1.9757793 0.1198118 5.683602 5576.734 0.0000000 1.5621842 2.4988753 1.5621842 2.4988753
8 raceOther/Multi 2.1120110 0.1530066 4.886328 4749.963 0.0000011 1.5646727 2.8508138 1.5646727 2.8508138

Bayesian Logistic Regression procedure

Bayesian Logistic Regression is performed on the imputed dataset from the pooled imputed adult dataset (adult_imp1) and provided the Bayesian model summary statistics, visualization, summary tables.

Code
library(gt)

# Bayesian Logistic Regression (formula weights) 
adult_imp1 <- complete(imp, 1) %>%
  dplyr::mutate(
    age_c  = as.numeric(scale(age)),
    bmi_c  = as.numeric(scale(bmi)),
    wt_norm = WTMEC2YR / mean(WTMEC2YR, na.rm = TRUE),
    # ensure factor refs match survey/mice:
    race = forcats::fct_relevel(race, "NH White"),
    sex  = forcats::fct_relevel(sex,  "Male")
  ) %>%
  dplyr::filter(!is.na(diabetes_dx), !is.na(age_c), !is.na(bmi_c),
                !is.na(sex), !is.na(race)) %>%
  droplevels()

stopifnot(all(is.finite(adult_imp1$wt_norm)))

glimpse(adult_imp1)
Rows: 5,592
Columns: 11
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, 23.6, 26.5, 22.0, 20.3…
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ age_c       <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c       <dbl> -0.33319172, -0.06755778, -0.02561558, -1.31184309, 1.7639…
$ wt_norm     <dbl> 0.3393916, 0.6160884, 1.4398681, 1.6500477, 0.6380722, 1.5…
Code
priors <- c(
  set_prior("normal(0, 2.5)", class = "b"),
  set_prior("student_t(3, 0, 10)", class = "Intercept") 
)

bayes_fit <- brm(
  formula = diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race,
  data    = adult_imp1,
  family  = bernoulli(link = "logit"),
  prior   = priors,
  chains  = 4, iter = 2000, seed = 123,
  control = list(adapt_delta = 0.95),
  refresh = 0   # quiet Stan output
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
prior_summary(bayes_fit)
               prior     class                coef group resp dpar nlpar lb ub
      normal(0, 2.5)         b                                                
      normal(0, 2.5)         b               age_c                            
      normal(0, 2.5)         b               bmi_c                            
      normal(0, 2.5)         b raceMexicanAmerican                            
      normal(0, 2.5)         b         raceNHBlack                            
      normal(0, 2.5)         b     raceOtherDMulti                            
      normal(0, 2.5)         b   raceOtherHispanic                            
      normal(0, 2.5)         b           sexFemale                            
 student_t(3, 0, 10) Intercept                                                
 tag       source
             user
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
             user
Code
summary(bayes_fit)            # Bayesian model summary
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.84    -2.50 1.00     4187     3510
age_c                   1.09      0.06     0.97     1.22 1.00     3012     3098
bmi_c                   0.63      0.05     0.53     0.72 1.00     3472     3315
sexFemale              -0.66      0.10    -0.86    -0.46 1.00     4003     3052
raceMexicanAmerican     0.69      0.18     0.35     1.04 1.00     3526     2843
raceOtherHispanic       0.43      0.24    -0.07     0.89 1.00     4058     3114
raceNHBlack             0.54      0.15     0.24     0.82 1.00     3597     3177
raceOtherDMulti         0.82      0.19     0.45     1.19 1.00     3763     3257

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
# Class distribution

ggplot(adult_imp1, aes(x = factor(diabetes_dx))) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Diabetes Diagnosis Distribution",
    x = "Diabetes Diagnosis (0 = No, 1 = Yes)",
    y = "Count"
  ) +
  theme_minimal()

Code
prop.table(table(adult_imp1$diabetes_dx))

       0        1 
0.889485 0.110515 
Code
# Summaries of numeric predictors
numeric_vars <- sapply(adult_imp1, is.numeric)
summary(adult_imp1[, numeric_vars])
  diabetes_dx          age             bmi          WTMEC2YR     
 Min.   :0.0000   Min.   :20.00   Min.   :14.1   Min.   :     0  
 1st Qu.:0.0000   1st Qu.:34.00   1st Qu.:24.1   1st Qu.: 18785  
 Median :0.0000   Median :48.00   Median :27.7   Median : 27803  
 Mean   :0.1105   Mean   :48.84   Mean   :29.0   Mean   : 39779  
 3rd Qu.:0.0000   3rd Qu.:63.00   3rd Qu.:32.4   3rd Qu.: 52293  
 Max.   :1.0000   Max.   :80.00   Max.   :82.9   Max.   :171395  
    SDMVPSU         SDMVSTRA         age_c              bmi_c         
 Min.   :1.000   Min.   :104.0   Min.   :-1.65751   Min.   :-2.09476  
 1st Qu.:1.000   1st Qu.:107.0   1st Qu.:-0.86038   1st Qu.:-0.69669  
 Median :1.000   Median :111.0   Median :-0.06326   Median :-0.19338  
 Mean   :1.486   Mean   :110.9   Mean   :-0.01563   Mean   :-0.01133  
 3rd Qu.:2.000   3rd Qu.:115.0   3rd Qu.: 0.79079   3rd Qu.: 0.46371  
 Max.   :2.000   Max.   :118.0   Max.   : 1.75873   Max.   : 7.52398  
    wt_norm      
 Min.   :0.0000  
 1st Qu.:0.4729  
 Median :0.7000  
 Mean   :1.0014  
 3rd Qu.:1.3165  
 Max.   :4.3150  
Code
# Summaries of factor predictors
factor_vars <- sapply(adult_imp1, is.factor)
summary(adult_imp1[, factor_vars])
     sex                     race     
 Male  :2669   NH White        :2398  
 Female:2923   Mexican American: 742  
               Other Hispanic  : 489  
               NH Black        :1141  
               Other/Multi     : 822  
Code
# Save your dataset as CSV
write.csv(adult_imp1, "adult_imp1.csv", row.names = FALSE)

Training and testing

To evaluate how well the model learned patterns from the training dataset, and how reliably it will perform on new data in real-world scenarios, we decided to perform Training/Testing on the imputed data

Testing set to evaluate model performance on unseen data. - Model Accuracy / Performance - Generalizability - Helps check for overfitting or underfitting - Reliability of Predictions - Confidence on model predictions for new individuals - Distribution and Bias Check - to ensure the training and testing sets are representative of the same population

Results from training and test data- Training (n=4473) & Testing (n=1119) sets: Age ~48 yrs, BMI ~28, bmi_c ~0, wt_norm ~1; ranges similar across sets.

Code
#Code here

import pandas as pd

# Load the data exported from R
adult_data_py = pd.read_csv("adult_imp1.csv")  # your R dataset saved as CSV

# Now you can rename it freely in Python
df = adult_data_py  # just create a new variable pointing to the same DataFrame

# Check the data
df.head()
   diabetes_dx  age   bmi     sex  ... SDMVSTRA     age_c     bmi_c   wt_norm
0            1   69  26.7    Male  ...      112  1.132418 -0.333192  0.339392
1            1   54  28.6    Male  ...      108  0.278360 -0.067558  0.616088
2            1   72  28.9    Male  ...      109  1.303230 -0.025616  1.439868
3            0   73  19.7  Female  ...      116  1.360167 -1.311843  1.650048
4            0   56  41.7    Male  ...      111  0.392234  1.763918  0.638072

[5 rows x 11 columns]
Code
import pandas as pd
from sklearn.model_selection import train_test_split

# Optional: rename columns if needed
df = df.rename(columns={"diabetes_dx": "diabetes_status"})

# Separate predictors (X) and outcome (y)
X = df.drop(columns=["diabetes_status"])
y = df["diabetes_status"]

# Create 80/20 train-test split, stratifying by outcome to keep class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1234, stratify=y
)

# Combine X and y if you want full train/test datasets
train_data = pd.concat([X_train, y_train], axis=1)
test_data  = pd.concat([X_test, y_test], axis=1)

# Check sizes
print("Training set:", train_data.shape)
Training set: (4473, 11)
Code
print("Testing set: ", test_data.shape)
Testing set:  (1119, 11)
Code
# Check class balance
print("\nTraining set class distribution:\n", y_train.value_counts(normalize=True))

Training set class distribution:
 diabetes_status
0    0.88956
1    0.11044
Name: proportion, dtype: float64
Code
print("\nTesting set class distribution:\n", y_test.value_counts(normalize=True))

Testing set class distribution:
 diabetes_status
0    0.889187
1    0.110813
Name: proportion, dtype: float64
Code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Combine train and test distributions into a DataFrame
train_dist = pd.DataFrame({'Class': y_train.unique(), 
                           'Proportion': y_train.value_counts(normalize=True).values,
                           'Set': 'Train'})

test_dist = pd.DataFrame({'Class': y_test.unique(), 
                          'Proportion': y_test.value_counts(normalize=True).values,
                          'Set': 'Test'})

dist_df = pd.concat([train_dist, test_dist])

# Plot
plt.figure(figsize=(8,5))
sns.barplot(x='Class', y='Proportion', hue='Set', data=dist_df)
plt.title('Class Distribution in Training and Testing Sets')
plt.ylabel('Proportion')
plt.xlabel('Diabetes Status')
plt.ylim(0,1)
(0.0, 1.0)
Code
plt.show()

Code
# Numeric summaries for predictors
print("Training set numeric summary:\n", X_train.describe())
Training set numeric summary:
                age          bmi  ...        bmi_c      wt_norm
count  4473.000000  4473.000000  ...  4473.000000  4473.000000
mean     48.926895    28.933736  ...    -0.020899     1.001234
std      17.571583     6.935321  ...     0.969609     0.786057
min      20.000000    14.100000  ...    -2.094764     0.000000
25%      34.000000    24.100000  ...    -0.696691     0.477152
50%      48.000000    27.700000  ...    -0.193384     0.700259
75%      63.000000    32.300000  ...     0.449729     1.286732
max      80.000000    77.500000  ...     6.769021     4.314957

[8 rows x 8 columns]
Code
print("\nTesting set numeric summary:\n", X_test.describe())

Testing set numeric summary:
                age          bmi  ...        bmi_c      wt_norm
count  1119.000000  1119.000000  ...  1119.000000  1119.000000
mean     48.475424    29.275782  ...     0.026921     1.002283
std      17.568193     7.744067  ...     1.082677     0.785208
min      20.000000    15.200000  ...    -1.940976     0.000000
25%      33.000000    23.900000  ...    -0.724652     0.453099
50%      47.000000    27.800000  ...    -0.179404     0.697037
75%      63.000000    32.900000  ...     0.533614     1.378142
max      80.000000    82.900000  ...     7.523981     3.877294

[8 rows x 8 columns]

Training (n=4473) & Testing (n=1119) sets: Age ~48 yrs, BMI ~28, bmi_c ~0, wt_norm ~1; ranges similar across sets.

Prior - We used student_t(3, 0, 10) prior R. V. D. Schoot et al. (2013) - student_t(ν,μ,σ)

Student’s t-distribution with: ν=3: degrees of freedom (controls tail heaviness) μ=0: location (center, like the mean) σ=10: scale (spread, like standard deviation) student_t(3, 0, 10) means: It’s centered at 0 It allows large values (± several times 10) It has heavy tails, since ν=3, allows for outliers or unexpected large parameter values Helps the model remain stable, especially with: Small datasets High correlation between predictors Potential outliers in the data

Bayesian Logistic Regression Analysis

Code
###
library(ggplot2)

# priors for two coefficients (age and bmi) prior = N(0,2.5)
prior_draws <- tibble(
  term = rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each = 4000),
  value = c(rnorm(4000, 0, 2.5), rnorm(4000, 0, 2.5))
)

ggplot(prior_draws, aes(x = value, fill = term)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Prior Distributions for Coefficients",
       x = "Coefficient Value", y = "Density") +
  scale_fill_manual(values = c("skyblue", "orange"))

Code
# Diabetes vs BMI

library(ggplot2)

# Create the plot
ggplot(adult_imp1, aes(x = factor(diabetes_dx), y = bmi, fill = factor(diabetes_dx))) +
  geom_boxplot(alpha = 0.7) +
  scale_x_discrete(labels = c("0" = "No Diabetes", "1" = "Diabetes")) +
  labs(
    x = "Diabetes Diagnosis",
    y = "BMI",
    title = "BMI Distribution by Diabetes Status"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Code
# logistic regression curve
ggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +
  geom_point(aes(y = diabetes_dx), alpha = 0.2, position = position_jitter(height = 0.02)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, color = "blue") +
  labs(
    x = "BMI",
    y = "Probability of Diabetes",
    title = "Predicted Probability of Diabetes vs BMI"
  ) +
  theme_minimal()

Posterior Draws

Once we get posterior draws, we calculated summary stats Mean, median, 95% credible intervals summary(bayes_fit), odds ratio and posterior_summary(bayes_fit) -

Visualization

  • mcmc_hist(posterior, pars = c(“b_age”))or mcmc_areas()
  • Pairwise plots Correlations between parameters mcmc_pairs(posterior)
  • Posterior predictive checks (ppc) for linearity
  • Compare model predictions vs observed data pp_check(bayes_fit)
  • bayes_R2 for model fit
  • ppc dens plots, ppc (mean and SD) and ppc bars
Code
summary(bayes_fit)
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.84    -2.50 1.00     4187     3510
age_c                   1.09      0.06     0.97     1.22 1.00     3012     3098
bmi_c                   0.63      0.05     0.53     0.72 1.00     3472     3315
sexFemale              -0.66      0.10    -0.86    -0.46 1.00     4003     3052
raceMexicanAmerican     0.69      0.18     0.35     1.04 1.00     3526     2843
raceOtherHispanic       0.43      0.24    -0.07     0.89 1.00     4058     3114
raceNHBlack             0.54      0.15     0.24     0.82 1.00     3597     3177
raceOtherDMulti         0.82      0.19     0.45     1.19 1.00     3763     3257

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(bayes_fit)   # Posterior distributions

Code
pp_check(bayes_fit)      # Posterior predictive checks

Code
mcmc_trace(bayes_fit)    # Convergence (optional)

Code
bayes_R2(bayes_fit)      # Model fit
    Estimate  Est.Error      Q2.5    Q97.5
R2 0.1313342 0.01265055 0.1064607 0.156078

**Estimated R²

  • Estimated R² = 0.13 (13.1%) -suggests predictors are relevant, most of the variability remains unexplained.
  • Other factors (like genetics, lifestyle, environment, etc.) might be contributing.

Below are the reported odds ratio from the posterior predicted values and the Bayesian regression

Code
library(ggplot2)
library(reshape2)

correlation_matrix <- cor(adult_imp1[, c("diabetes_dx", "age", "bmi")], use = "complete.obs", method = "pearson")
correlation_melted <- melt(correlation_matrix)

ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab", name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "Features", y = "Features")

Code
# Posterior ORs (drop intercept, clean labels)

bayes_or <- posterior_summary(bayes_fit, pars = "^b_") %>%
  as.data.frame() %>%
  tibble::rownames_to_column("raw") %>%
  dplyr::mutate(
    term = gsub("^b_", "", raw),
    term = gsub("race", "race:", term),
    term = gsub("sex",  "sex:",  term),
    term = gsub("OtherDMulti", "Other/Multi", term),
    term = gsub("OtherHispanic", "Other Hispanic", term),
    OR   = exp(Estimate),
    LCL  = exp(Q2.5),
    UCL  = exp(Q97.5)
  ) %>%
  dplyr::select(term, OR, LCL, UCL) %>%
  dplyr::filter(term != "Intercept")

knitr::kable(
  bayes_or %>%
    dplyr::mutate(dplyr::across(c(OR,LCL,UCL), ~round(.x, 2))),
  digits = 2,
  caption = "Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)"
)
Bayesian posterior odds ratios (95% CrI) — reference: NH White (race), Male (sex)
term OR LCL UCL
age_c 2.99 2.64 3.37
bmi_c 1.87 1.71 2.05
sex:Female 0.52 0.42 0.63
race:MexicanAmerican 2.00 1.41 2.84
race:Other Hispanic 1.54 0.93 2.43
race:NHBlack 1.71 1.28 2.27
race:Other/Multi 2.27 1.56 3.28
Code
# Combined table

if (!dir.exists("outputs")) dir.create("outputs", recursive = TRUE)
saveRDS(svy_fit,   "outputs/svy_fit.rds")
saveRDS(pool_mi,   "outputs/pool_mi.rds")
saveRDS(bayes_fit, "outputs/bayes_fit.rds")
saveRDS(svy_or,    "outputs/survey_OR_table.rds")
saveRDS(mi_or,     "outputs/mi_OR_table.rds")
saveRDS(bayes_or,  "outputs/bayes_OR_table.rds")
Code
# Results

 #Build compact results table (BMI & Age only) 
library(dplyr); 
library(tidyr); 
library(knitr); 
library(stringr)

# pretty "OR (LCL–UCL)" string

  
  fmt_or <- function(or, lcl, ucl, digits = 2) {
  paste0(
    formatC(or,  format = "f", digits = digits), " (",
    formatC(lcl, format = "f", digits = digits), "–",
    formatC(ucl, format = "f", digits = digits), ")"
  )
}

# guardrails: require these to exist from Modeling
stopifnot(exists("svy_or"), exists("mi_or"), exists("bayes_or"))
for (nm in c("svy_or","mi_or","bayes_or")) {
  if (!all(c("term","OR","LCL","UCL") %in% names(get(nm)))) {
    stop(nm, " must have columns: term, OR, LCL, UCL")
  }
}

svy_tbl   <- svy_or   %>% mutate(Model = "Survey-weighted MLE")
mi_tbl    <- mi_or    %>% mutate(Model = "mice pooled")
bayes_tbl <- bayes_or %>% mutate(Model = "Bayesian")

all_tbl <- bind_rows(svy_tbl, mi_tbl, bayes_tbl) %>%
  mutate(term = case_when(
    str_detect(term, "bmi_c|\\bBMI\\b") ~ "BMI (per 1 SD)",
    str_detect(term, "age_c|\\bAge\\b") ~ "Age (per 1 SD)",
    TRUE ~ term
  )) %>%
  filter(term %in% c("BMI (per 1 SD)", "Age (per 1 SD)")) %>%
  mutate(OR_CI = fmt_or(OR, LCL, UCL, digits = 2)) %>%
  select(Model, term, OR_CI) %>%
  arrange(
    factor(Model, levels = c("Survey-weighted MLE","mice pooled","Bayesian")),
    factor(term,  levels = c("BMI (per 1 SD)","Age (per 1 SD)"))
  )

res_wide <- all_tbl %>%
  pivot_wider(names_from = term, values_from = OR_CI) %>%
  rename(
    `BMI (per 1 SD) OR (95% CI)` = `BMI (per 1 SD)`,
    `Age (per 1 SD) OR (95% CI)` = `Age (per 1 SD)`
  )

kable(
  res_wide,
  align = c("l","c","c"),
  caption = "Odds ratios (per 1 SD) with 95% CIs across models"
)
Odds ratios (per 1 SD) with 95% CIs across models
Model BMI (per 1 SD) OR (95% CI) Age (per 1 SD) OR (95% CI)
Survey-weighted MLE 1.89 (1.65–2.15) 3.03 (2.70–3.40)
mice pooled 1.73 (1.58–1.89) 2.90 (2.60–3.24)
Bayesian 1.87 (1.71–2.05) 2.99 (2.64–3.37)
Code
# Posterior predictive draws

#Posterior predictive checks (binary outcome)
pp_samples <- posterior_predict(bayes_fit, ndraws = 500)  # 500 draws

# Check dimensions
dim(pp_samples)  # rows = draws, cols = observations
[1]  500 5592
Code
# Plot overlay of observed vs predicted counts (duplicate image)
ppc_dens_overlay(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ]) +
  labs(title = "Posterior Predictive Check: Density Overlay") +
  theme_minimal()

Code
ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ])

Code
#PP check for proportions (useful for binary) # mean comparison
## to check if the simulated means match the observed mean

## mean
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "mean") +
  labs(title = "Posterior Predictive Check: Mean of Replicates") +
  theme_minimal()

Code
## sd
ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat = "sd") +
  labs(title = "PPC: Standard Deviation of Replicates") +
  theme_minimal()

Comparative Visualizations

  • Predicted vs observed - to check how well the model’s predictions align with reality where mean(y_rep) = average predicted probability of diabetes for each individual, across all posterior draws of the parameters. y = the actual observed diabetes status (0 = non-diabetic, 1 = diabetic).
  • mcmc plot (dens plots) comparing observed and posterior parameter values (estimates) for bmi_c, age_c, sex_female, and by race categories
  • Fitted (Predicted) vs observed for bmi using point and error bars
  • Fitted (Predicted) vs observed for bmi using line plot
Code
# Combine observed and predicted into one data frame
plot_data <- adult_imp1 %>%
  mutate(
    predicted_bmi = predicted[, "Estimate"],
    lower_ci = predicted[, "Q2.5"],
    upper_ci = predicted[, "Q97.5"],
    obs_index = 1:nrow(adult_imp1)  # index for x-axis
  )

# Line plot
ggplot(plot_data, aes(x = obs_index)) +
  geom_line(aes(y = bmi, color = "Observed")) +               # observed BMI
  geom_line(aes(y = predicted_bmi, color = "Predicted")) +   # predicted BMI
  geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha = 0.2) +  # uncertainty
  labs(x = "Observation", y = "BMI", color = "Legend") +
  theme_minimal()

Code
summary(adult_imp1$bmi)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   14.1    24.1    27.7    29.0    32.4    82.9 
Code
summary(plot_data$bmi_c)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.09476 -0.69669 -0.19338 -0.01133  0.46371  7.52398 

Centering for bmi - Summary of original bmi (observed data) and centered version of BMI. - Centering doesn’t change the distribution shape, only shifts it so the mean is zero. - Centering is useful in regression/Bayesian models to improve numerical stability and interpretability of intercepts - Plots showing predicted values of bmi and age (prior vs predicted) and the proportion of diabetes=1 for each draw

Code
prior_summary(bayes_fit)
               prior     class                coef group resp dpar nlpar lb ub
      normal(0, 2.5)         b                                                
      normal(0, 2.5)         b               age_c                            
      normal(0, 2.5)         b               bmi_c                            
      normal(0, 2.5)         b raceMexicanAmerican                            
      normal(0, 2.5)         b         raceNHBlack                            
      normal(0, 2.5)         b     raceOtherDMulti                            
      normal(0, 2.5)         b   raceOtherHispanic                            
      normal(0, 2.5)         b           sexFemale                            
 student_t(3, 0, 10) Intercept                                                
 tag       source
             user
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
     (vectorized)
             user
Code
prior_draws <- tibble(
  term = rep(c("BMI (per 1 SD)", "Age (per 1 SD)"), each = 4000),
  estimate = c(rnorm(4000, 0, 1), rnorm(4000, 0, 1)),
  type = "Prior"
)

summary(prior_draws)
     term              estimate             type          
 Length:8000        Min.   :-3.585220   Length:8000       
 Class :character   1st Qu.:-0.697054   Class :character  
 Mode  :character   Median : 0.000046   Mode  :character  
                    Mean   :-0.007350                     
                    3rd Qu.: 0.668561                     
                    Max.   : 3.922817                     
Code
post
# A draws_df: 1000 iterations, 4 chains, and 11 variables
   b_Intercept b_age_c b_bmi_c b_sexFemale b_raceMexicanAmerican
1         -2.6     1.1    0.70       -0.71                  0.67
2         -2.7     1.0    0.62       -0.57                  0.65
3         -2.6     1.1    0.65       -0.76                  0.63
4         -2.7     1.0    0.65       -0.67                  0.82
5         -2.6     1.1    0.61       -0.73                  0.75
6         -2.5     1.0    0.60       -0.77                  0.61
7         -2.8     1.1    0.66       -0.66                  0.52
8         -2.8     1.2    0.67       -0.57                  0.94
9         -2.8     1.1    0.65       -0.52                  0.84
10        -2.6     1.1    0.67       -0.85                  0.70
   b_raceOtherHispanic b_raceNHBlack b_raceOtherDMulti
1                0.605          0.52              0.95
2                0.338          0.45              0.69
3                0.566          0.63              0.54
4                0.453          0.61              0.78
5                0.090          0.50              0.62
6                0.015          0.48              0.60
7                0.736          0.50              0.84
8                0.913          0.57              1.07
9                0.570          0.66              0.81
10               0.467          0.54              0.97
# ... with 3990 more draws, and 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
Code
head(post)
# A draws_df: 6 iterations, 1 chains, and 11 variables
  b_Intercept b_age_c b_bmi_c b_sexFemale b_raceMexicanAmerican
1        -2.6     1.1    0.70       -0.71                  0.67
2        -2.7     1.0    0.62       -0.57                  0.65
3        -2.6     1.1    0.65       -0.76                  0.63
4        -2.7     1.0    0.65       -0.67                  0.82
5        -2.6     1.1    0.61       -0.73                  0.75
6        -2.5     1.0    0.60       -0.77                  0.61
  b_raceOtherHispanic b_raceNHBlack b_raceOtherDMulti
1               0.605          0.52              0.95
2               0.338          0.45              0.69
3               0.566          0.63              0.54
4               0.453          0.61              0.78
5               0.090          0.50              0.62
6               0.015          0.48              0.60
# ... with 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
Code
names(prior_draws)
[1] "term"     "estimate" "type"    
Code
library(brms)
library(tidyr)

# Extract posterior draws
post <- as_draws_df(bayes_fit) %>%      # bayes_fit = your brms model
  select(b_bmi_c, b_age_c) %>%               # select your coefficient columns
  pivot_longer(
    everything(),
    names_to = "term",
    values_to = "estimate"
  ) %>%
  mutate(
    term = case_when(
      term == "b_bmi_c" ~ "BMI (per 1 SD)",
      term == "b_age_c" ~ "Age (per 1 SD)"
    ),
    type = "Posterior"
  )

## visualization of prior and predicted draws
combined_draws <- bind_rows(prior_draws, post) 

library(ggplot2)

ggplot(combined_draws, aes(x = estimate, fill = type)) +
  geom_density(alpha = 0.4) +
  facet_wrap(~ term, scales = "free", ncol = 2) +
  theme_minimal(base_size = 13) +
  labs(
    title = "Prior vs Posterior Distributions",
    x = "Coefficient estimate",
    y = "Density",
    fill = ""
  )

Code
# Compute proportion of diabetes=1 for each draw
pp_proportion <- rowMeans(pp_samples)  # proportion of 1's in each posterior draw

# Summary of posterior proportions
summary(pp_proportion)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09067 0.10533 0.10944 0.10935 0.11320 0.12715 
Code
# Optional: visualize the posterior probability distribution
pp_proportion_df <- tibble(proportion = pp_proportion)

ggplot(pp_proportion_df, aes(x = proportion)) +
  geom_histogram(binwidth = 0.01, fill = "skyblue", color = "black") +
  labs(
    title = "Posterior Distribution of Proportion of Diabetes = 1",
    x = "Proportion of Diabetes = 1",
    y = "Frequency"
  ) +
  theme_minimal()

  • Prior and predicted draws
  • Posterior predicted proportion of Diabetes vs NHANES prevalence of Diabetes in the population
Code
library(tidyverse)

# Posterior predicted proportion vector
# pp_proportion <- rowMeans(pp_samples)  # if not already done

known_prev <- 0.089   # NHANES prevalence

# Posterior summary
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# Create a data frame for plotting
pp_df <- tibble(proportion = pp_proportion)

# Plot
ggplot(pp_df, aes(x = proportion)) +
  geom_histogram(binwidth = 0.005, fill = "skyblue", color = "black") +
  geom_vline(xintercept = known_prev, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = posterior_mean, color = "blue", linetype = "solid", size = 1) +
  geom_rect(aes(xmin = posterior_ci[1], xmax = posterior_ci[2], ymin = 0, ymax = Inf),
            fill = "blue", alpha = 0.1, inherit.aes = FALSE) +
  labs(
    title = "Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
    subtitle = paste0("Red dashed = NHANES prevalence (", known_prev, 
                      "), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),
    x = "Proportion of Diabetes = 1",
    y = "Frequency"
  ) +
  theme_minimal()

Code
library(dplyr)

# Posterior predicted proportion
posterior_mean <- mean(pp_proportion)
posterior_ci <- quantile(pp_proportion, c(0.025, 0.975))  # 95% credible interval

# NHANES prevalence with SE from survey::svymean
# Suppose you already have:
# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
known_prev <- 0.089        # Mean prevalence
known_se   <- 0.0048       # Standard error from survey

# Calculate 95% confidence interval
known_ci <- c(
  known_prev - 1.96 * known_se,
  known_prev + 1.96 * known_se
)

# Print results
data.frame(
  Type = c("Posterior Prediction", "NHANES Prevalence"),
  Mean = c(posterior_mean, known_prev),
  Lower_95 = c(posterior_ci[1], known_ci[1]),
  Upper_95 = c(posterior_ci[2], known_ci[2])
)
                     Type      Mean   Lower_95  Upper_95
2.5% Posterior Prediction 0.1093519 0.09781831 0.1212446
        NHANES Prevalence 0.0890000 0.07959200 0.0984080
Code
library(ggplot2)
library(dplyr)

# Create a data frame for plotting
ci_df <- data.frame(
  Type = c("Posterior Prediction", "NHANES Prevalence"),
  Mean = c(0.1096674, 0.089),
  Lower_95 = c(0.09772443, 0.079592),
  Upper_95 = c(0.1210658, 0.098408)
)

# Plot
ggplot(ci_df, aes(x = Type, y = Mean, color = Type)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width = 0.2) +
  ylim(0, max(ci_df$Upper_95) + 0.02) +
  labs(
    title = "Comparison of Posterior Predicted Diabetes Proportion vs NHANES Prevalence",
    y = "Proportion of Diabetes",
    x = ""
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

Results

Multiple Linear Regression Equation

Model**:**
`svyglm(formula = form_cc, design = des_cc, family = quasibinomial())`

All predictors are significant: age (p < 0.001) and BMI (p < 0.001) show strong positive associations with the outcome, while being female (p = 0.0004) is negatively associated. Other significant associations include raceMexican American (p = 0.0008), raceOther Hispanic (p = 0.0087), raceNH Black (p = 0.0117), and raceOther/Multi (p = 0.0014).

Multivariate Imputation by Chained Equations

All predictors are statistically significant.

Positive associations: age (p \< 0.001), BMI (p \< 0.001), and all
race categories compared to reference.

Negative association: being female (p \< 0.001)

# Bayesian Logistic Regression

Sampling**:** NUTS (4 chains, 2000 iterations each; 1000 warmup,
4000 post-warmup draws)

Convergence & Diagnostics

  • Rhat = 1.00 for all parameters → excellent convergence
  • Bulk_ESS / Tail_ESS: Large values (>2000) → high effective sample sizes, reliable posterior estimates.

Interpretation

  • Strong predictors: Age and BMI are strongly positively associated with diabetes risk.
  • Sex effect: Females have a lower probability of diabetes compared to males
  • R² = 0.13 shows 13% of the variance in the outcome (diabetes_dx) is explained by your predictors (age, BMI, sex, race), 95% Credible Interval: 0.106–0.156, indicates that, given your model and data, the true proportion of variance explained is plausibly between 10.6% and 15.6% and shows uncertainty in the explained variance, which is natural for probabilistic models.
  • Est.Error = 0.0127, reflects the standard error of the R² estimate across posterior samples. The small SE indicates that the R² estimate is fairly precise.
  • Race/ethnicity: Mexican American, NH Black, and “Other/Diverse” groups have higher odds of diabetes. Other Hispanic group has a less certain effect.
  • age_c-Each 1-unit increase in centered age increases the log-odds of diabetes by 1.09. Strong positive effect.
  • bmi_c-Higher BMI is associated with higher diabetes risk. sexFemale-Females have lower odds of diabetes compared to males.
  • raceMexicanAmerican-Higher odds of diabetes vs. reference race (likely NH White)
  • raceOtherHispanic-Slightly higher odds vs reference, but interval crosses zero → uncertain effect.
  • raceNHBlack-Significantly higher odds of diabetes compared to reference. raceOtherDMulti-Higher odds of diabetes vs reference group.

Posterior distribution of all parameters in the model. Density - plot of posterior samples each parameter (e.g., intercept, slope) into a smoothed density curve, showing most of the posterior probability mass lies for best estimates and uncertainty.

Posterior Predictive Distribution - generated from posterior - predictive draws: 𝑦rep∼𝑝(𝑦new∣𝜃)yrep​∼p(ynew​∣θ) simulate the data given posterior parameter estimates. - Posterior predictive checks (PPC) compare these simulations to real data to assess model fit.

Incorporating two sources of uncertainty: Parameter uncertainty: captured in the posterior distributions Predictive uncertainty: captured in posterior predictive draws

Combining the two provide credible intervals for predictions, not just point predictions and specifies - Given the BMI, the probability of diabetes is likely between 40–55%.

Autocorrelation Interpretation - To show the correlation of a sample with previous samples (lags) in the MCMC chain. Lag indicates the number of steps between samples. Lag 0 is always 1 (perfect correlation with itself). - Each chain (1–4) shown separately for b_age_c (age coefficient) and b_bmi_c (BMI coefficient) presents autocorrelation drops quickly to near zero after lag 1–2 for both coefficients in all chains suggesting good mixing: successive samples are mostly independent after a short lag. - No persistent high autocorrelation indicates MCMC chains are converging well. - Low autocorrelation, as in your plot, is desirable because it means your posterior estimates are reliable and not biased by chain dependence.

Comparing Models

  • All three models (survey-weighted MLE, multiple imputation, Bayesian) agree closely on the direction and magnitude of the effects of BMI and age.
  • Age is a stronger predictor than BMI, nearly tripling the odds per 1 SD.
  • BMI significantly increases diabetes risk (~1.7–1.9× per 1 SD).
  • Differences between models are minor, indicating robust and reliable findings despite missing data or modeling approach.

Diabetes Predicted proportion vs population prevalence - Posterior predicted proportion is slightly higher than the observed NHANES prevalence (10.97% vs 8.9%). - Intervals overlap slightly, but the posterior tends to overestimate diabetes compared to NHANES.

The difference could be due to: - Model assumptions (logistic link, priors) - Predictor effects (BMI, age, sex, race) - Sample characteristics vs population weighting in NHANES

Conclusion

The results find our model is reasonable but slightly conservative (predicting higher risk) relative to the observed population prevalence.

Across multiple modeling approaches—survey-weighted maximum likelihood, multiple imputation, and Bayesian regression—both age and BMI were consistently strong predictors of diabetes. Eachstandard deviation increase in age nearly tripled the odds of diabetes, while a similar increase in BMI elevated the odds by approximately 1.7–1.9 times. The consistency of these results across models highlights the robustness of the associations and underscores the importance of age and BMI as key risk factors for diabetes in this population.

Effect stability: point estimates in rhe Bayesian model’s closely aligned with those from the frequentist, indicating that prior regularization stabilized the estimates in the presence of modest missingness.

Uncertainty quantification: Bayesian credible intervals of odds ration were slightly narrower yet overlapped the frequentist confidence intervals, suggest comparable inferential precision while offering improved interpretability.

Design considerations: # Survey-weighted MLE (Maximum Likelihood Estimator) - incorporates each observation weighted according to its survey weight. - provide estimates that reflect the population-level parameters, not just the sample- produces population-representative estimates. # Bayesian model with normalized weights- - instead of fully modeling the survey design, it used normalized sampling weights as importance weights - the scaled weights that sum to the sample size approximates the effect of survey weights, but does not fully account for: Stratification, clustering, design-based variance adjustments. - Bayesian inference treats the weighted likelihood as from a regular model, ignoring some survey design features.

Discussions

The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The = Across all models, both age and BMI emerged as strong and consistent predictors of diabetes. The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates. align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes. Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence. Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.

Limitations

  1. The study is based on cross-sectional/observational NHANES data, which limits the ability to make causal inferences. Associations observed between BMI, age, diabetes status cannot confirm causation.
  2. The project relies on multiple imputation for missing values, even though imputation reduces bias, it assumes missingness is at random (MAR); if data are missing not at random (MNAR), results may be biased.
  3. Potential Residual Confounding - Models included key predictors (age, BMI, sex, race), but unmeasured factors like diet, physical activity, socioeconomic status, or genetic predisposition could confound associations.
  4. Bayesian models depend on prior choices, which could influence posterior estimates if priors are informative or mis-specified.
  5. Variable Measurement - BMI is measured at a single time point, which may not reflect long-term exposure or risk.
  6. Self-reported variables - are subjective to recall or reporting bias.
  7. Interactions are not tested in the study (bmi × age) and so other potential synergistic effects might be missed.
  8. Predicted probabilities are model-based estimates, not validated for clinical decision-making. External validation in independent cohorts is needed before application.

QUESTION for targeted therapy

Translational Perspective from the Bayesian Diabetes Prediction Project. This project further demonstrates the translational potential of Bayesian modeling in clinical decision-making and public health strategy. By using patient-level predictors such as age, BMI, sex, and race to estimate the probability of diabetes, the model moves beyond descriptive statistics toward individualized risk prediction. The translational move lies in converting these probabilistic outputs into actionable thresholds—such as identifying the BMI or age at which the predicted risk of diabetes exceeds a clinically meaningful level (e.g., 30%). Such insights can guide early screening, personalized lifestyle interventions, and targeted prevention programs for populations at higher risk. This approach embodies precision public health—bridging data science and medical decision-making to deliver tailored, evidence-based strategies that can ultimately improve diabetes prevention and management outcomes.

What changes in modifiable predictors would lower diabetes risk?

# Translational Research Implications: - We now use the model to guide prevention or intervention. - Only BMI is a modifiable risk factor here - What must change in BMI, behavior, or lifestyle to achieve a lower risk threshold? In practice, we hold non modifiable predictors as constant (sex, race). Vary modifiable predictors (BMI) until the model predicts the desired probability.

# Below we used the data of first two particiants from the same dataset (Internal validation - as the data is from the same dataset) # Next we used fake data of a new participant (not in the dataset) - For model validation (External validation as the data is out-of-sample) - It means - applying the trained model to a specific individual’s data (new participanst) — using posterior draws to estimate predictive uncertainty (credible interval). - It is to test generalizability — how well the model performs on new, independent data.

  • this approach bridges individual-level predictions with population-level evidence.
  • predicted probabilities of diabetes for an individual participant obtained using the fitted Bayesian logistic regression model, accounted for model uncertainty and a 95% credible interval represents the range within which the true probability of diabetes for the participant is expected to lie.
  • using this approach researchers can better understand the uncertainty and variability in clinical risk predictions, allowing for more personalized and data-informed decision-making, helping translate statistical models into clinically actionable insights.
  • such probabilistic predictions can help identify individuals at higher risk and can guide targeted screening, preventive interventions, and policy decisions, ultimately supporting precision medicine and improving patient outcomes.
Code
# Use the first participant 
# using multiple covariates to select someone
participant1_data  <- adult[1, ]


# predicted probabilities for patient 1
phat1 <- posterior_linpred(bayes_fit, newdata = participant1_data, transform = TRUE)
# 'transform = TRUE' gives probabilities for logistic regression

# Store in a data frame for plotting
post_pred_df <- data.frame(pred = phat1)

# Compute 95% credible interval
ci_95_participant1 <- quantile(phat1, c(0.025, 0.975))

# Plot

ggplot(post_pred_df, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue') +
  geom_vline(xintercept = ci_95_participant1[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_participant1[2], color='red', linetype='dashed') +
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution 95% Credible Interval') +
  theme_bw()

Code
library(ggplot2)


new_participant <- data.frame(
  age_c = 40,
  bmi_c = 25,
  sex   = "Female",
  race  = "Mexican American"
)

# Posterior predicted probabilities
phat_new <- posterior_linpred(bayes_fit, newdata = new_participant, transform = TRUE)

# Convert to numeric vector
phat_vec <- as.numeric(phat_new)

# Check the range to see if all values are similar
range(phat_vec)
[1] 1 1
Code
# Store in a data frame
post_pred_df_new <- data.frame(pred = phat_vec)

# Compute 95% credible interval from the vector
ci_95_new_participant <- quantile(phat_vec, c(0.025, 0.975))

# Plot
ggplot(post_pred_df_new, aes(x = pred)) + 
  geom_density(color='darkblue', fill='lightblue', alpha = 0.6) +
  geom_vline(xintercept = ci_95_new_participant[1], color='red', linetype='dashed') +
  geom_vline(xintercept = ci_95_new_participant[2], color='red', linetype='dashed') +
  xlim(0, 1) +  # ensures you see the curve even if values are close
  xlab('Probability of being diabetic (Outcome=1)') +
  ggtitle('Posterior Predictive Distribution (95% Credible Interval)') +
  theme_bw()

Targeted bmi

In this analysis, a grid of body mass index (BMI) values is generated to examine the relationship between BMI and the predicted probability of diabetes while holding other covariates (age, sex, and race) constant. - Using the fitted Bayesian logistic regression model, posterior predictive probabilities computed for each BMI value and the mean of posterior draws estimated the average predicted probability of diabetes across the BMI range. - We then tried to identify the BMI value whose predicted probability was closest to a predefined target probability (here, 0.3). - This enabled interpretation of the model in a clinically meaningful way—specifically, by determining the BMI level associated with a 30% predicted probability of diabetes for a given demographic profile. - We were able to link statistical inference to practical health thresholds relevant for risk communication and prevention strategies in translational research.

Implications

  • age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
  • Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.

References

Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.” National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.” Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. “Augmenting Data With Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. “Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Schoot, Rens Van De, David Kaplan, Jaap Denissen, Jens B Asendorpf, and Marcel A G Van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.” https://doi.org/10.1111/cdev.12169.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.
Zeger, Scott L, Zhenke Wu, Yates Coley, Anthony Todd Fojo, Bal Carter, Katherine O’Brien, Peter Zandi, et al. 2020. “Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment,” no. 2020. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=medp&NEWS=N&AN=37708307.